A machine learning-based prediction of hospital mortality in mechanically ventilated ICU patients

Background Mechanical ventilation (MV) is vital for critically ill ICU patients but carries significant mortality risks. This study aims to develop a predictive model to estimate hospital mortality among MV patients, utilizing comprehensive health data to assist ICU physicians with early-stage alerts. Methods We developed a Machine Learning (ML) framework to predict hospital mortality in ICU patients receiving MV. Using the MIMIC-III database, we identified 25,202 eligible patients through ICD-9 codes. We employed backward elimination and the Lasso method, selecting 32 features based on clinical insights and literature. Data preprocessing included eliminating columns with over 90% missing data and using mean imputation for the remaining missing values. To address class imbalance, we used the Synthetic Minority Over-sampling Technique (SMOTE). We evaluated several ML models, including CatBoost, XGBoost, Decision Tree, Random Forest, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Logistic Regression, using a 70/30 train-test split. The CatBoost model was chosen for its superior performance in terms of accuracy, precision, recall, F1-score, AUROC metrics, and calibration plots. Results The study involved a cohort of 25,202 patients on MV. The CatBoost model attained an AUROC of 0.862, an increase from an initial AUROC of 0.821, which was the best reported in the literature. It also demonstrated an accuracy of 0.789, an F1-score of 0.747, and better calibration, outperforming other models. These improvements are due to systematic feature selection and the robust gradient boosting architecture of CatBoost. Conclusion The preprocessing methodology significantly reduced the number of relevant features, simplifying computational processes, and identified critical features previously overlooked. Integrating these features and tuning the parameters, our model demonstrated strong generalization to unseen data. This highlights the potential of ML as a crucial tool in ICUs, enhancing resource allocation and providing more personalized interventions for MV patients.


Introduction
In the United States, over one million patients receive mechanical ventilation (MV) annually in Intensive Care Units (ICU), occupying 24-41% of ICU beds at any given time [1].Although MV is frequently considered a lifesaving intervention, patients undergoing non-surgical procedures that require MV have a hospital mortality rate exceeding 35% [2].This high mortality rate underscores the necessity of studying the relationship between MV and mortality to develop predictive models that can aid in early intervention.
Current predictive models for MV patient mortality often lack sufficient accuracy and practical applicability.Existing studies often rely on limited features or fail to address data imbalance, reducing model effectiveness.This study aims to develop an improved predictive model to estimate the mortality of MV patients using patient health data, including early-stage symptoms [3].The model is intended to assist ICU physicians in early alerting by utilizing a comprehensive database containing easily obtainable and well-generalized health data for better accuracy and prediction [4].Previous research has demonstrated the feasibility of using patient health data for predictive modeling in ICU settings, highlighting the importance of integrating such models into clinical practice [5,6].
There are many factors to consider when assessing risk for mechanically ventilated patients, including predictors available on the first day, which can be used to predict hospital mortality [7].Effective machine learning prediction requires careful feature selection, accounting for the severity of the disease and outcomes to understand early prediction factors and the impact of MV [8,9].Previous studies have shown that XGBoost performs well in predicting hospital mortality, aiding in understanding different clinical situations for early-stage alerting [10,11].However, different models may be suitable for various purposes beyond performance, considering clinical outcomes and contributing factors, which can influence mortality rates [12,13].
Most studies have not utilized advanced machine learning algorithms to their full potential or effectively addressed class imbalance issues, leaving a significant gap in the literature.This study's key advantages include: (1) employing a comprehensive feature selection process using backward elimination, the Lasso method, and expert clinical insights; (2) effectively addressing class imbalances using SMOTE; (3) leveraging advanced algorithms known for superior handling of categorical variables and computational efficiency; and (4) applying rigorous data preprocessing techniques, such as mean imputation and label encoding for categorical transformations.
From the present study, CatBoost performed better than other approaches.Clinical situations and feature selection are crucial in assessing both short-and long-term factors [14][15][16].By implementing machine learning approaches, we can better understand how these methods, combined with current healthcare data methodologies, improve the evaluation of risk factors [17,18].
We thoroughly examined the inclusion criteria for our feature selection process using the MIMIC-III database when assessing risk factors, followed by various data extraction techniques for data preprocessing.Additionally, we implemented several machine learning models, each yielding unique results and offering different perspectives on predicting the mortality rate for MV patients.

Data description
The Medical Information Mart for Intensive Care III (MIMIC-III) is a publicly accessible database [4].It contains de-identified health information from over 40,000 ICU admissions at the Beth Israel Deaconess Medical Center, covering the years 2001 to 2012 [19].Developed by the MIT Lab for Computational Physiology, MIMIC-III includes a wide range of data categories, such as demographics, vital signs, laboratory test results, medications, and mortality outcomes.This comprehensive database supports extensive research in clinical informatics.

Patient selection
We initially included patients who required mechanical ventilation (MV) in the ICU.To exclude incomplete and duplicated data, we refined the dataset according to specific inclusion criteria: (I) patients aged between 18 and 90 years; (II) patients with complete mortality information; (III) patients with sufficient clinical data, ensuring columns with fewer than 90% missing data were included.We utilized ICD-9 codes to identify relevant patient records and linked these with ventilation-related data, resulting in a final cohort of 25,202 patient samples.The flowchart of patient selection and data preprocessing is illustrated in Fig 1.

Feature selection and data preprocessing
The feature selection process in this study involved several stages.Initially, backward elimination and the Lasso method were employed to identify the most significant features [20,21].This selection was further refined through an extensive review of existing literature and clinical insights, resulting in the selection of 32 features.
The demographic data included age.Vital signs such as heart rate (HR), respiratory rate (RR), respiratory rate set (RR Set), temperature (TEMP), non-invasive blood pressure systolic (NIBP Systolic), non-invasive blood pressure diastolic (NIBP Diastolic), arterial blood pressure systolic (ABP Systolic), and arterial blood pressure diastolic (ABP Diastolic) were recorded.Additionally, laboratory values encompassing bicarbonate, serum creatinine, serum potassium concentration, and serum sodium concentration were included.Comorbidities such as liver failure, chronic heart failure, organ failure, sepsis, uncomplicated hypertension, and respiratory dysfunction were also documented.We included counts of specific diseases in our model based on literature and expert advice, which highlight their strong association with mortality in mechanically ventilated (MV) patients.This enhances our model's performance by capturing the impact of these critical conditions.Differences in vital signs, such as the differences in heart rate, respiratory rate, NIBP systolic, and temperature, were calculated.The detailed overview of feature information used in this study is presented in Table 1.
To address class imbalances in the target variables, the Synthetic Minority Over-sampling Technique (SMOTE) was applied [22].Data preprocessing involved removing columns with over 90% missing data and applying mean imputation to the remaining missing values.Categorical features were converted into numerical codes using a Label Encoder to integrate them into regression and machine learning models.These preprocessing steps were essential for standardizing the dataset, enabling efficient model training and evaluation, and ensuring that analyses accurately reflected the original measurements and categories present in the dataset.

Model development and optimization
Our final dataset included 25,202 patients with 32 features.We performed a 70/30 train-test split to facilitate model evaluation, choosing this method over cross-validation to improve performance and computational efficiency.We developed and assessed several machine learning (ML) algorithms, including Logistic Regression, CatBoost, XGBoost, Decision Tree, Random Forest, Support Vector Machine (SVM), and K-Nearest Neighbors (KNN).
The performance of these models was primarily measured using the Area Under the Receiver Operating Characteristic (AUROC) scores, with accuracy and F1 scores also calculated for a comprehensive comparison.Given the widespread use of AUROC in existing literature, it was selected as the primary metric for model evaluation [23].CatBoost emerged as the top performer, confirming its superior performance relative to other models.Introduced in 2017, CatBoost is a novel boosting method based on Gradient-Boosted Decision Trees (GBDT).It offers several advantages, including support for categorical variables, streamlined parameterization, fast prediction capabilities, and notable accuracy [24].Our study highlights the advantages of using CatBoost in hospital mortality prediction.
To provide a thorough comparison, we included six widely employed machine learning algorithms as baseline models based on their frequent selection in the literature: Decision Tree, Random Forest, SVM, KNN, Logistic Regression, and XGBoost.This selection allowed us to evaluate the effectiveness of our data processing and feature selection methods comprehensively.We aimed to demonstrate that our overall approach could yield superior results across these models.Decision trees partition the feature space hierarchically, minimizing impurity in each split [25].Random Forest, an ensemble method, aggregates multiple decision trees for prediction [26].SVM seeks an optimal hyperplane for class separation, maximizing the margin [27].KNN assigns class labels based on the majority vote among the k nearest neighbors [28].Logistic Regression estimates binary outcome probabilities using a logistic function [29].XGBoost, an accelerated gradient boosting implementation, iteratively enhances predictive accuracy.These diverse methodologies allowed us to comprehensively evaluate Cat-Boost's performance [30].
In comparing CatBoost with these baseline models, distinct differences emerged [31].Cat-Boost was included due to its distinct advantages, such as efficiently handling categorical features and mitigating overfitting through ordered boosting [32].We hypothesized that CatBoost would generate a high-quality predictive model, ultimately enabling us to identify the best-performing model for our task.CatBoost's symmetric algorithm for gradient-boosted decision trees provided significant enhancements in accuracy, robustness, and computational efficiency [33].Evaluation metrics such as accuracy, precision, recall, F1-score, AUROC metrics, and calibration plots highlighted the predictive strengths of each model across various criteria.By employing these methodologies, we developed a robust predictive model for hospital mortality in mechanically ventilated ICU patients.This model offers valuable insights, supporting clinical decision-making and optimizing resource allocation within ICUs.

Statistical analysis of models
To validate the statistical robustness of our model results, we conducted a comprehensive statistical analysis comparing the train and test sets using t-tests and chi-square tests.The dataset was split into a 70% train set and a 30% test set, which were used to evaluate the performance of trained models.A threshold p-value of 0.05 was set to determine the significance of differences [34].This analysis focused on identifying any significant discrepancies between the two datasets, ensuring the reliability of the study's findings.T-tests were employed to compare the means of continuous variables between the train and test sets.For categorical variables, chi-square tests were utilized to assess the independence between the two sets.We tested the hypothesis that there was no significant difference between the train and test sets.If the p-value for each variable was greater than 0.05, we could reject the null hypothesis and conclude that the test set had notable distinctions from the train set.This analysis was crucial for confirming the consistency and generalizability of our model.

Features importance
We analyzed each variable's impact on our model using SHAP (SHapley Additive exPlanations) values, which measure feature importance.SHAP values are particularly useful in the medical domain as they provide clear and interpretable insights into model predictions, aiding healthcare professionals in understanding the reasoning behind diagnostic decisions [35,36].SHAP values reveal the most influential features in predicting the model's output by ranking them based on their impact [37].
The graphical representation of SHAP values includes a bar plot of mean absolute SHAP values, indicating how much each feature contributes to the model's predictions.For instance, 'Age' was identified as the most significant feature, followed by 'Uncomplicated_hyperten-sion_count' and 'respiratory_dysfunction_count'.These top-ranked features had higher mean absolute SHAP values, showing their significant impact on the model's output.Lesser-impact features, such as 'Heart_Rate' and 'Bicarbonate_max_value', had lower SHAP values.The SHAP summary plot, presented in Fig 2 , visually demonstrates these findings.This analysis provides valuable insights into the internal mechanics of our machine learning models, ensuring transparency, improving model accuracy, and informing decision-making by highlighting key drivers of predictions.
This graphical representation provides a valuable tool for understanding the inner workings of complex ML models.By interpreting the average impact of each feature on model output magnitude, we gain insights into which features drive predictions the most and help ensure transparency in the predictive process.This, in turn, aids in debugging, improving model accuracy, and making informed decisions.This analysis provides valuable insights into the internal mechanics of our machine learning models, ensuring transparency, improving model accuracy, and informing decision-making by highlighting key drivers of predictions.

Cohort characteristics model completion
Following our approach for feature selection and data preprocessing for ICU patients requiring mechanical ventilation, our final dataset included 25,202 patients from the MIMIC-III database.The selected cohort was randomly split into training and testing sets with a ratio of 70/30, resulting in 17,641 patients for training and 7,561 patients for test.The training set was used to train the models, while the testing set was employed to evaluate the performance of our proposed model.To rigorously evaluate the disparity between the training and testing sets, we conducted a statistical analysis focusing on the hypothesis that the testing set exhibited notable distinctions from the training set.A threshold p-value of 0.05 was set to ascertain the significance of the differences.The analysis aimed to identify any substantial gaps between the two datasets, ensuring the reliability of the study's findings.The detailed cohort values and pvalues reflecting differences between the training and testing sets are presented in Table 2.The results indicated that the training and testing sets were largely comparable.However, one variable, 'Chronic_heart_failure_count,' reflected a significant difference between the datasets, as evidenced by its p-value.Overall, the analysis confirmed that, apart from this variable, there were no significant differences between the training and testing sets, ensuring that the model's results are reliable and generalizable.

Evaluation metrics proposed and baseline models' performance
The results for both the proposed and baseline models are summarized in Table 3.The proposed approach, using the CatBoost model, achieved an AUROC of 0.862, an accuracy score of 0.789, and an F1 score of 0.747.These metrics highlight the model's accuracy and robustness in reducing both Type I and Type II errors, validating its effectiveness for our predictive modeling tasks.The plot of the ROC curves for both the proposed and baseline models is presented in Fig 3.Among the baseline models developed using the MIMIC-III database, the Support Vector Machine (SVM) and XGBoost models performed notably well, with high AUROC, accuracy, precision, recall, and F1 scores.In contrast, the Decision Tree and K-Nearest Neighbors (KNN) models showed relatively lower performance across these metrics.Overall, the CatBoost model emerged as the top-performing model, followed closely by SVM and XGBoost.The results indicate that the proposed approach outperforms the best baseline models.
Fig 4 shows the calibration curves for seven machine learning models: CatBoost, XGBoost, SVM, Logistic Regression, Random Forest, KNN, and Decision Tree.The X-axis represents the predicted probability, and the Y-axis represents the true probability, with the diagonal line indicating perfect calibration.Observations reveal that CatBoost, XGBoost, and Logistic Regression exhibit good calibration, closely aligning with the diagonal line.SVM and Random Forest show moderate calibration, with some deviation.In contrast, KNN and Decision Tree exhibit poor calibration, slightly deviating from the diagonal line.These results suggest that CatBoost, XGBoost, and Logistic Regression are better at predicting accurate probabilities.

Summary of existing model compilation
Several models have been concurrently developed to predict mortality for ICU patients, specifically those receiving mechanical ventilation (MV).These include machine learning frameworks using various techniques and databases, with a particular focus on the MIMIC-III database due to its comprehensive nature.
Kim et al. demonstrated that the XGB model outperformed conventional scoring systems like APACHE II and ProVent in predicting 30-day mortality for mechanically ventilated patients, achieving an AUC of 0.79 [38].Mamandipoor et al. showed that an RNN-based model had higher predictive performance for respiratory disorder patients in the ICU, with an AUC of 0.75 when evaluating patients admitted with respiratory disorders, based on 50 features [39].Yu et al. developed a CatBoost model to predict in-hospital mortality for MV patients with COVID-19, achieving an AUC of 0.90.The model utilized demographic information, vital signs, and laboratory values from the ER [10].George et al. created a deep learning model to predict 3-month mortality in patients requiring more than 7 days of mechanical ventilation, achieving an AUC of 0.74, which outperformed the ProVent model (AUC 0.59) under similar conditions, using 250 features [19].
Notably, Zhu et al. utilized a wide range of features and ML methods, ultimately finding that the XGBoost model performed best with the highest AUC among their tested models.In Zhu's study, which also utilized the MIMIC-III database, data preprocessing resulted in a dataset of 25,659 patients and 55 predictors.Their features included demographics, ICU diagnosis, pre-ICU comorbidities, vital signs, disease severity scores, and laboratory test results from the first day of ICU admission.Various machine learning methods were employed, including KNN, bagging, logistic regression, decision tree, XGBoost, random forest, and neural networks [7].
In contrast to the existing literature, our study employs a more streamlined feature selection process, using only 32 features derived from backward elimination, the Lasso method, and expert clinical insights.We addressed class imbalances using the Synthetic Minority Oversampling Technique (SMOTE) and managed missing data with median imputation and categorical transformations via Label Encoder.Our proposed approach leverages the CatBoost model, known for its superior handling of categorical variables and computational efficiency.The CatBoost model in our study achieved an AUROC of 0.862, a significant improvement from the initial 0.821 AUROC score, and demonstrated notable performance in terms of accuracy (0.789) and F1-score (0.747).This is a marked improvement over Zhu et al.'s findings, which, despite using more features (55 in total), did not achieve as high an AUROC score.Additionally, the calibration plots showed that our models are well-calibrated.
The key advantages of our approach over previous studies are multifold: (1) our systematic feature selection process, combined with clinical insights, ensures that only the most pertinent features are included, reducing complexity and enhancing model accuracy; (2) the application of SMOTE effectively addresses the imbalance in target variables, which was not considered in previous studies; and (3) CatBoost's gradient boosting architecture and efficiency in handling high-dimensional data make it a superior choice for this application, outperforming traditional models in both accuracy and computational speed.

Study limitations and future research
Our study has several limitations that need to be acknowledged.Firstly, we were unable to validate our model using external datasets due to the lack of access to comprehensive databases similar to MIMIC-III.This limitation restricts our ability to confirm the generalizability and robustness of our proposed model across different patient populations and clinical settings.Future research should focus on validating our models using external datasets to ensure their applicability in diverse healthcare environments.This includes testing the model's performance metrics, such as AUROC, accuracy, and F1-score, on new data to confirm consistency and identifying any biases or limitations in different patient populations.Integrating real-time data streams and employing continuous learning models will ensure the model remains current with the latest clinical data and practices, enhancing its predictive accuracy and utility.
Secondly, the MIMIC-III database we utilized is over ten years old and lacks comprehensive historical information for many patients.This could introduce bias in our dataset selection, potentially affecting the model's performance and its relevance to current clinical practices.The reliance on an older database may not fully capture the advancements in ICU care and patient management that have occurred over the past decade.To address this limitation, future research should aim to leverage newer databases that reflect the latest clinical data and practices, thereby improving the model's accuracy and applicability.
Additionally, while our study implemented advanced feature selection and data preprocessing techniques, there is still room for improvement in handling missing data and imbalances in the dataset.Future studies could explore more sophisticated imputation methods and balancing techniques to further enhance model performance.By addressing these limitations, future research can build on our findings and contribute to more robust and generalizable predictive models for ICU patient outcomes.
To enhance the accuracy of the CatBoost predictive model, several strategies can be employed: (1) Grid search or Bayesian optimization can identify optimal parameters, improving performance, although this process can be computationally expensive and time-consuming.(2) Feature engineering, involving new features based on domain knowledge or techniques like feature interaction, can significantly boost accuracy, but it relies on domain expertise and may introduce bias if not done carefully.(3) Incorporating ensemble methods to combine predictions from multiple models can reduce overfitting and enhance predictive accuracy, yet it increases model complexity, making it harder to interpret and requiring more computational resources.Despite these limitations, these strategies collectively enhance the accuracy and robustness of the CatBoost predictive model.
Future research should explore the integration of deep learning approaches, such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), which can provide significant improvements in handling complex and high-dimensional data.In this study, we used maximum, average, and minimum values of laboratory data as indicators of patient conditions.However, this approach overlooks important time series information, such as data fluctuations and trends.Incorporating time series analysis could capture these temporal patterns, enhancing the model's performance and generalizability.Time series methods can identify rapid changes in parameters often associated with high mortality rates, leading to more precise and reliable predictions.
Furthermore, it is essential to consider the clinical implementation of predictive models.Developing user-friendly tools and interfaces that can be seamlessly integrated into clinical workflows will be crucial.Engaging with clinicians during the development process to understand their needs and preferences will ensure that the models are not only accurate but also practical and easy to use in everyday clinical practice.Additionally, understanding privacy perceptions and behavior is crucial when dealing with sensitive health data, ensuring ethical standards and patient trust [40,41].This approach will ultimately enhance the adoption and impact of predictive models in improving patient outcomes in the ICU.
Lastly, future studies should investigate the potential of employing real-time data streams and continuous learning models to keep the predictive systems up-to-date with the latest clinical data and practices.This dynamic approach could significantly improve the model's responsiveness and accuracy, ensuring that it remains relevant and effective in rapidly evolving clinical environments.

Conclusion
This study significantly improves the prediction of hospital mortality for ICU patients on mechanical ventilation using advanced machine learning techniques.By applying data imputation, bootstrapping, and model optimization, we enhanced our models' predictive accuracy and robustness, as shown by the substantial improvements in AUC and accuracy metrics.
The data preprocessing methodology significantly reduced the number of relevant features, simplifying computational processes, and identified critical features previously overlooked.Integrating these features and tuning the parameters, our model demonstrated strong generalization to unseen data.
The CatBoost model, noted for its efficient handling of categorical variables and computational demands, proved particularly effective.Our streamlined feature selection process and the use of SMOTE ensured the inclusion of relevant features, reducing complexity and boosting model performance.The CatBoost model achieved an AUROC of 0.862, an accuracy of 0.789, and an F1-score of 0.747, significantly outperforming other models tested.Our streamlined feature selection process, employing backward elimination, the Lasso method, and expert clinical insights, ensured the inclusion of 32 relevant features.The use of SMOTE addressed class imbalances effectively, further boosting model performance.Additionally, our models demonstrated good calibration, ensuring that predicted probabilities closely matched actual outcomes.
The key results of this study include enhanced predictive accuracy, effective feature handling, robust feature selection, and improved calibration.These findings highlight the potential of leveraging advanced machine learning techniques to aid clinicians in making more informed decisions, ultimately improving patient outcomes in ICUs.Continued research and validation with diverse datasets will further enhance the applicability and reliability of these predictive models across different clinical settings.

Table 2 . Detailed overview of cohort characteristics for train and test cohort.
Values are presented as means with the standard deviations in parentheses. https://doi.org/10.1371/journal.pone.0309383.t002